Posts with tag Data Munging

Back to all posts

String distance measurements are useful for cleaning up the sort of messy data from multiple sources.

There are a bunch of string distance algorithms, which usually rely on some form of calculations about the similarities of characters. But in real life, characters are rarely the relevant units: you want a distance measure that penalized changes to the most information-laden parts of the text more heavily than to the parts that are filler.